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Abstract 

Many  combinatorial  optimization  algorithms  have  no  mechanism  to  capture 
inter-parameter  dependencies.  However,  modeling  such  dependencies  may  allow 
an  algorithm  to  concentrate  its  sampling  more  effectively  on  regions  of  the  search 
space  which  have  appeared  promising  in  the  past.  We  present  an  algorithm  which 
incrementally  learns  second-order  probability  distributions  from  good  solutions 
seen  so  far,  uses  these  statistics  to  generate  optimal  (in  terms  of  maximum  likeli¬ 
hood)  dependency  trees  to  model  these  distributions,  and  then  stochastically  gen¬ 
erates  new  candidate  solutions  from  these  trees.  We  test  this  algorithm  on  a 
variety  of  optimization  problems.  Our  results  indicate  superior  performance  over 
other  tested  algorithms  that  either  (1)  do  not  explicitly  use  these  dependencies,  or 
(2)  use  these  dependencies  to  generate  a  more  restricted  class  of  dependency 
graphs. 
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1.  INTRODUCTION 


When  performing  combinatorial  optimization,  we  wish  to  concentrate  the 
generation  of  candidate  solutions  to  regions  of  the  solution  space  which  have  a 
high  probability  of  containing  solutions  better  than  those  previously  sampled. 
Most  optimization  algorithms  attempt  to  do  this  by  searching  around  the  location 
of  the  best  previous  solutions.  There  are  many  ways  to  do  this: 

Gradient  Descent/Simulated  Annealing:  Only  generate  candidate  solutions 
neighboring  a  single  current  solution.  These  algorithms  make  no  attempt  to 
model  the  dependence  of  the  solution  quality  upon  particular  solution  parameters. 

Genetic  Algorithms  (GAs):  attempt  to  implicitly  capture  dependencies  between 
parameters  and  the  solution  quality  by  concentrating  samples  on  combinations  of 
high-performance  members  of  the  current  population,  through  the  use  of  the 
crossover  operator.  Crossover  combines  the  information  contained  within  pairs 
of  selected  "parent"  solutions  by  placing  random  subsets  of  each  parent's  bits  into 
their  respective  positions  in  a  new  "child"  solution.  Note  that  no  explicit 
information  is  kept  about  which  groups  of  parameters  contributed  to  the  quality 
of  candidate  solutions.  Therefore,  the  crossover  is  randomized,  and  must  be 
performed  repeatedly,  as  most  of  the  crossovers  yield  unproductive  results. 

Population-Based  Incremental  Learning  (PBIL):  PBIL  explicitly  captures  the 
first-order  dependencies  between  individual  solution  parameters  and  solution 
quality.  PBIL  uses  a  vector  of  first-order  probabilities  to  model  a  simple 
probability  distribution  over  all  bit-strings.  The  probability  vector  is  adjusted  to 
increase  the  likelihood  of  generating  high-evaluation  solutions.  To  do  this,  as  the 
search  progresses,  the  probability  vector  is  moved  towards  the  highest- 
performing  individual  in  each  generation.  No  attempt  is  made  to  model  the  inter¬ 
parameter  dependencies.  However,  for  many  optimization  problems  drawn  from 
the  GA  literature,  the  use  of  these  statistics  alone  allows  PBIL  to  outperform 
standard  GAs  and  hill-climbers  [Baluja,  1997]. 

Mutual  Information  Maximization  for  Input  Clustering  (MIMIC):  This  work 
extends  PBIL  by  (I)  capturing  some  of  the  pair-wise  dependencies  between  the 
solution  parameters,  and  (2)  providing  a  statistically  meaningful  way  of  updating 
the  distribution  from  which  samples  are  generated.  MIMIC  maintains  at  least  the 
top  N%  of  all  previously  generated  solutions,  from  which  it  calculates  pair-wise 
conditional  probabilities.  MIMIC  uses  a  greedy  search  to  generate  a  chain  in 
which  each  variable  is  conditioned  on  the  previous  variable.  The  first  variable  in 
the  chain,  X^,  is  chosen  to  be  the  variable  with  the  lowest  unconditional  entropy 
H(Xj).  When  deciding  which  subsequent  variable  Xj^j  to  add  to  the  chain, 
MIMIC  selects  the  variable  with  the  lowest  conditional  entropy  H(Xj_^j  I  Xj). 
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After  creating  the  full  chain,  it  randomly  generates  more  samples  from  the 
distribution  specified  by  this  chain.  The  entire  process  is  then  repeated. 

The  work  presented  in  this  paper  extends  MIMIC.  Our  algorithm  maintains 
second-order  statistics  similar  to  those  employed  by  MIMIC.  However,  while 
MIMIC  was  restricted  to  a  greedy  heuristic  for  finding  chain-based  models,  our 
algorithm  uses  a  broader  class  of  models,  and  finds  the  optimal  model  within  that 
class.  The  algorithm  will  be  described  in  Section  2.  The  empirical  results 
described  in  Section  3  demonstrate  that,  in  most  cases  examined,  the  performance 
of  the  combinatorial  optimization  algorithms  consistently  improves  as  the 
accuracy  of  the  statistical  model  increases.  In  Section  4,  we  draw  our  conclusions 
and  discuss  avenues  of  future  research. 


2.  ALGORITHM 

We  wish  to  model  a  probability  distribution  P(Xi,  ...,  Xjj)  over  bit-strings  of 
length  n,  where  Xj,  ...,  X^^  are  variables  corresponding  to  the  values  of  the  bits. 
Suppose  we  restrict  our  model  P'(Xi, ...,  X^,)  to  models  of  the  following  form: 

n 

i  =  1 

Here,  m  =  (mj,  ...,  m^)  is  some  unknown  permutation  of  (1,  ...,  n);  p(i)  maps  the 
integers  0  <  i  <  n  to  integers  0  <  p(i)  <  i;  and  P(Xj  I  Xq)  is  by  definition  equal  to 
P(Xj)  for  all  i.  In  other  words,  we  restrict  P'  to  factorizations  in  which  the 
conditional  probability  distribution  for  any  one  bit  depends  on  the  value  of  at 
most  one  other  bit.  (In  Bayesian  network  terms,  this  means  we  are  restricting  our 
models  to  networks  in  which  each  node  can  have  at  most  one  parent.) 

A  method  for  finding  the  optimal  model  within  these  restrictions  is  presented  in 
[Chow  and  Liu,  1968].  A  complete  weighted  graph  G  is  created  in  which  every 
variable  Xj  is  represented  by  a  corresponding  vertex  Vj,  and  in  which  the  weight 
Wjj  for  the  edge  between  vertices  Vj  and  Vj  is  set  to  the  mutual  information 
I(Xj,Xj)  between  Xj  and  Xj.  The  edges  in  the  maximum  spanning  tree  of  G 
determine  an  optimal  set  of  (n-1)  first-order  conditional  probabilities  with  which 
to  model  the  original  probability  distribution.  Since  the  edges  in  G  are 
undirected,  a  decision  must  be  made  about  the  directionality  of  the  dependencies 
with  which  to  construct  P';  however,  all  such  orderings  conforming  to  Equation  1 
model  identical  distributions.  Among  all  trees,  this  algorithm  produces  the  tree 
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which  maximizes  the  likelihood  of  the  data  when  the  algorithm  is  applied  to 
empirical  observations  drawn  from  any  unknown  distribution. 

We  employ  a  version  of  this  algorithm  for  combinatorial  optimization  as 
follows.  We  incrementally  learn  second-order  statistics  from  previously  seen 
"good"  bit-strings;  using  a  version  of  Chow  and  Liu's  algorithm,  we  determine 
optimal  subsets  of  these  statistics  with  which  to  create  model  probability 
distributions  P'(Xj,  ...,  Xj^)  of  the  form  assumed  in  Equation  1.  These 
distributions  are  used  to  generate  new  candidate  bit-strings  which  are  then 
evaluated.  The  best  bit-strings  are  used  to  update  the  second-order  statistics; 
these  statistics  are  used  to  generate  another  dependency  tree;  and  so  forth,  until 
the  algorithm's  termination  criteria  are  met. 

Our  algorithm  maintains  an  array  A  containing  a  number  A[Xj=a,  for 

every  pair  of  variables  Xj  and  Xj  and  every  combination  of  binary  assignments  to 
a  and  b,  where  A[Xj=a,  ^j=b]  is  as  an  estimate  of  how  many  recently  generated 
"good"  bit  strings  have  had  bit  X  =a  and  bit  Xj=b.  Since  the  probability 
distribution  will  gradually  shift  towards  better  bit-strings,  we  put  more  weight  on 
more  recently  generated  bit-strings.  All  A[Xj=a,  ^]=b]  are  initialized  to  some 
constant  Cjj^j^  before  the  first  iteration  of  the  algorithm;  this  causes  the  algorithm's 
first  set  of  bit-strings  to  be  generated  from  the  uniform  distribution. 

We  calculate  first-  and  second-order  probabilities  P(Xj)  and  P(Xj,  Xj)  from  the 
current  values  of  A[Xj=a,  Xj=Z7].  From  these  probabilities  we  calculate  the  mutual 
information,  I(Xj,  Xj),  between  all  pairs  of  variables  Xj  and  Xj: 


I(X,Xp  =  l^P{X^  =  a,X.= 

a,  b 


P{X.  =  a,X.=  b) 

b)  ■  log - - - 

^P{X.  =  a)  ■  P{Xj  =  b) 


(2) 


We  then  create  a  dependency  tree  containing  an  optimum  set  of  n-1  first-order 
dependencies.  To  do  this,  first,  we  select  an  arbitrary  bit  Xj-^^j  to  place  at  the  root 
of  the  tree.  (In  our  implementation  we  select  the  bit  with  the  lowest  unconditional 
entropy,  in  order  to  make  it  more  comparable  with  MIMIC.)  Then,  we  add  all 
other  bits  to  the  tree  as  follows:  we  find  the  pair  of  bits  Xj^  and  X^^j  -  where  X^^^ 
is  any  bit  not  yet  in  the  tree,  and  Xjj^  is  any  bit  already  in  the  tree  -  with  the 
maximum  mutual  information  I(Xjjj,  Xq^).  We  add  X^^j  to  the  tree,  with  Xjjj  as 
its  parent,  and  repeat  this  process  until  all  the  bits  have  been  added  to  the  tree.  By 
keeping  track  of  which  bit  inside  the  tree  has  the  highest  mutual  information  with 
each  bit  still  outside  the  tree,  we  can  perform  the  entire  tree-growing  operation  in 

0(n  )  time.  Because  our  algorithm  is  a  variant  of  Prim's  algorithm  for  finding 
minimum  spanning  trees  [Prim,  1957],  it  can  easily  be  shown  that  it  constructs  a 
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tree  that  maximizes  the  sum 


(3) 

i  =  1 

which  in  turn  minimizes  the  Kullback-Leibler  divergence  D(PIIPO,  as  shown  in 
[Chow  &  Liu,  1968].  For  all  i,  our  algorithm  sets  m(i)  =  j  if  Xj  was  the  i*^^  bit 
added  to  the  tree,  and  sets  p(i)  =  j  if  the  j*  bit  to  be  added  to  the  tree  was  the 
parent  of  the  i*  bit  to  be  added.  (m(p(l))  is  Xq,  a  dummy  node  with  a  constant 
value.)  Our  modeled  probability  distribution  P'(Xi,  ...,  X^,)  is  then  specified  by 
Equation  1.  Among  all  distributions  of  the  same  form,  this  distribution 
maximizes  the  likelihood  of  the  data  when  the  data  is  a  set  of  empirical 
observations  drawn  from  any  unknown  distribution. 

Once  we  have  generated  a  model  dependency  tree  specifying  P'(Xi,  ...,  Xj^),  we 
use  it  to  generate  K  new  bit-strings.  Each  bit-string  is  generated  in  0(n)  time 
during  a  depth-first  traversal  of  the  tree,  and  then  evaluated;  the  best  M  are 
chosen  to  update  A.  Eirst,  we  multiply  all  entries  in  A  by  a  decay  factor,  a, 
between  0  and  1;  this  puts  more  weight  on  the  more  recently  generated  bit- 
strings.  Then,  we  add  1.0  to  A[Xj=a,  Xj=Z7]  for  every  bit-string  out  of  the  best  M 
in  which  bit  Xpa  and  bit  ^j=b.  After  updating  A,  we  recompute  the  mutual 
information  between  all  pairs  of  variables,  use  these  to  generate  another 
dependency  tree,  use  this  tree  to  generate  more  samples,  and  continue  the  cycle 
until  a  termination  condition  is  met.  High-level  pseudo-code  is  shown  in  Eigure  1 . 

The  values  of  A[Xj=a,  ^]=b]  at  the  beginning  of  an  iteration  may  be  thought  of 
as  specifying  a  prior  probability  distribution  over  “good”  bit-strings:  the  ratios  of 
the  values  within  A[Xj=a,  X^=b]  specify  the  distribution,  while  the  magnitudes  of 
these  values,  multiplied  by  a,  specify  an  “equivalent  sample  size”  reflecting  how 
confident  we  are  that  this  prior  probability  distribution  is  accurate.^ 

Unlike  the  algorithm  used  in  [De  Bonet,  et  ah,  1997],  we  do  not  maintain  a 
record  of  data  from  previous  generations.  This  eliminates  the  time  and  memory 
that  would  be  required  to  keep  a  large  number  of  old  bit-strings  sorted  according 
to  their  evaluations.  Our  method  is  also  closer  to  that  employed  by  PBIE,  which 
make  empirical  comparisons  with  PBIE  more  meaningful.  On  the  other  hand,  the 
relationship  between  the  entries  in  A[Xj=a,  ^]=b]  and  the  data  the  algorithm  has 


1 .  The  values  of  A  are,  in  some  respects,  similar  to  the  coefficients  of  Dirichlet  distributions  —  a  correspondence 
which  suggests  the  use  of  Bayesian  scoring  metrics  in  place  of  information-theoretic  ones  in  the  future,  such  as 
those  used  in  [Cooper  and  Herskovits,  1992]  and  [Heckerman,  et  ak,  1995]. 
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INITIALIZATION: 

For  all  bits  i  and  j  and  all  binary  assignments  to  a  and  b,  initialize  A  [Xpa,  Xj=fc]  to  Cjjiit. 

MAIN  LOOP:  Repeat  until  Termination  Condition  is  Met 

1 .  Generate  a  dependency  tree 

-Set  the  root  to  an  arbitrary  bit  X^^^j 

-For  all  other  bits  X;,  set  bestMatchingBitlnTreeiXJ  to  X^gt. 

-While  not  all  bits  have  been  added  to  the  tree: 

-Out  of  all  the  hits  not  yet  in  the  tree,  pick  the  hit  with  the  maximum 

mutual  information  IlX^dd,  hestMatchingBitInTree[Xj,jj]),  using  A  to  estimate 
the  relevant  probability  distributions. 

-Add  to  the  tree,  with  hestMatchingBitInTree[Xj,ji(i]  as  its  parent. 

-For  each  hit  X^^  still  not  in  the  tree,  compare  ICX^^,  bestMatchingBitInTree[XQm])  with 
I(Xout ,  X^dd)-  If  I(Xout.  Xadd)  is  greater,  set  hestMatchingBitInTree[Xom]  =  X^dd- 

2.  Generate  K  bit-strings  based  on  the  joint  probability  encoded  by  the  dependency  tree 
generated  in  the  previous  step.  Evaluate  these  bit-strings. 

3.  Multiply  all  of  the  entries  in  A  by  a  decay  factor  CX  between  0  and  1. 

4.  Choose  the  best  M  of  the  K  bit-strings  generated  in  step  2. 

For  each  bit-string  S  of  these  M,  add  1.0  to  every  A[X;=a,  Xj=/?]  such  that  S  has  Xj=a  and  Xj  =b. 


Figure  1:  Basic  Algorithm.  For  each  bit  not  yet  in  the  tree,  bestMatchingBitInTree[XQ^J 
maintains  a  pointer  to  a  bit  in  the  tree  with  which  X^^j  has  maximum  mutual  information. 


generated  in  the  past  is  less  clear.  In  this  paper,  we  compare  the  performance  of 
our  dependency  tree  algorithm  with  an  algorithm  similar  to  MIMIC  that  produces 
chain-shaped  dependency  graphs.  Both  algorithms  update  the  values  of 
A[Xj=a,  Xj=Z7]  in  the  manner  described  above.  In  future  work,  we  will  investigate 
the  effects  of  explicitly  maintaining  and  using  previously  generated  solutions. 

Example  dependency  graphs  shown  in  Figure  2  illustrate  the  types  of 
probability  models  learned  by  PBIL,  our  dependency  chain  algorithm,  and  our 
dependency  tree  algorithm.  We  use  Bayesian  network  notation  for  our  graphs:  an 
arrow  from  node  Xp  to  node  X,,  indicates  that  X^-’s  probability  distribution  is 
conditionally  dependent  on  the  value  of  Xp.  These  models  were  learned  while 
optimizing  a  noisy  version  of  a  two-color  graph  coloring  problem  in  which  there 
is  a  0.5  probability  of  adding  1  to  the  evaluation  function  for  every  edge 
constraint  satisfied  by  the  candidate  solution. 
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Figure  2:  A:  A  noisy  two-color  graph  coloring  problem.  B:  the  empty  dependency  graph 
used  by  PBIL.  C:  the  graph  learned  by  our  implementation  of  the  dependency  chain 
algorithm.  D;  the  graph  learned  by  our  dependency  tree  algorithm. 


3.  EMPIRICAL  COMPARISONS 

In  this  section,  we  present  an  empirical  comparison  of  four  algorithms  on  five 
classes  of  problems.  For  each  problem,  each  algorithm  was  allowed  2000 
generations,  with  200  evaluations  per  generation.  All  algorithms  were  run 
multiple  times,  as  specified  with  the  problem  description.  To  measure  the 
significance  of  the  differences  between  the  Tree  and  Chain  algorithms,  the  Mann- 
Whitney  rank-sum  test  is  used.  This  is  a  non-parametric  equivalent  of  the 
standard  two-sample  Mest.  The  algorithms  compared  are  described  below: 

Trees:  This  is  an  implementation  of  the  algorithm  described  in  this  paper.  Cj^it 
is  set  to  1000,  and  the  decay  rate  a  is  set  to  0.99.  (Other  decay  rates  were 
empirically  tested  on  a  single  problem;  the  value  of  0.99  yielded  the  best  results 
for  both  Trees  and  Chains.)  K=200  samples  are  created  per  generation.  The  top 
2%  (M=4  samples)  are  used  to  update  the  statistics  (the  A  matrix).  All  parameters 
were  held  constant  through  all  of  the  runs. 

Chain:  This  algorithm  is  identical  to  the  Tree  algorithm,  except  the  dependency 
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graphs  are  restricted  to  chains,  the  type  of  dependency  graph  used  in  the  MIMIC 
algorithm.  All  of  the  other  parameters’  values  are  the  same  as  for  Trees,  and  were 
held  constant  in  all  of  the  runs. 

PBIL:  This  is  the  basic  version  of  the  PBIL  algorithm  described  in  [Baluja, 
1995].  The  algorithm  and  parameters  are  shown  in  Figure  3.  The  parameters  were 
only  changed  on  the  peaks  problems;  parameter  settings  for  the  peaks  problem 
were  taken  from  [Baluja  &  Caruana,  1995]. 


Initialization. 

for  i:=1  to  LENGTH  do  P[i]  =  0.5; 

while  (NOT  termination  condition) 

Generate  Samples. 

for  i:=1  to  SAMPLES  do 

sample_vectors[i]  :=  generate_sample_vector_according_to_probabilities  (P); 

evaluations[i]  :=  evaluate(sample_vectors[i]); 

best_vector  :=  find_vector_with_best_evaluation  (sample_vectors,  evaluations); 

Update  Towards  Best. 

for  i;=1  to  LENGTH  do 

P[i]  ;=  P[i]*(1.0-LR)  +  best_vector[i]*(LR); 

for  i;=1  to  LENGTH  do 

Mutate  Probability 

if  (random  (0..1)  <  MUTATION_RATE); 

Vector. 

direction  ;=  random  (0  or  1); 

P[i]  ;=  P[i]*(1.0-MUTATION_SHIFT)  +  direction*(MUTATION_SHIFT) 

PBIL:  USER  DEFINED  CONSTANTS  (Values  Used  in  this  Study): 

SAMPLES;  the  number  of  vectors  generated  before  update  of  the  probability  vector  (200). 

LR:  the  learning  rate,  how  fast  to  exploit  the  search  performed  (0.1 ). 

LENGTH:  the  number  of  bits  in  a  generated  vector  (problem  specific). 

MUTATION  RATE:  chances  of  mutation  (0.02),  MUTATION  SHIFT:  magnitude  of  mutation  (0.05). 

Figure  3:  PBIL  algorithm  for  a  binary  alphabet,  adapted  from  [Baluja  &  Caruana,  1995]. 


Genetic  Algorithm  (GA):  Except  when  noted  otherwise,  the  GA  used  in  this 
paper  had  the  following  parameters:  80%  crossover  rate  (the  chances  that  a 
crossover  occurs  with  two  parents;  if  crossover  does  not  occur,  the  parents  are 
copied  to  the  children  without  modification);  uniform  crossover  (in  each  child, 
there  is  a  50%  probability  of  inheriting  the  bit  value  from  parent  A,  50%  from 
parent  B  [Syswerda,  1989]),  mutation  rate  0.1%  (probability  of  mutating  each 
bit),  elitist  selection  (the  best  solution  from  generatioUg  replaces  the  worst 
solution  in  generatioUg+p  [Goldberg,  1989],  and  population  size  200.  The  GAs 
used  fitness  proportional  selection  (this  means  that  the  chances  of  selecting  a 
member  of  the  population  for  recombination  is  proportional  to  its  evaluation). 
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With  this  method  of  selection,  the  GA  is  the  only  algorithm  tested  which  is 
sensitive  to  the  magnitudes  of  the  evaluation  (all  the  other  algorithms  work  with 
ranks).  To  help  account  for  this,  the  lowest  evaluation  in  each  generation  is 
subtracted  from  all  the  evaluations  before  probabilities  of  selection  are 
calculated.  Other  rank-based  selection  metrics  were  also  explored;  however,  they 
did  not  reveal  consistently  better  results.  Of  all  the  algorithms,  the  most  effort 
was  spent  tuning  the  GA’s  parameters.  Additionally,  for  many  problems  multiple 
GAs  were  attempted  with  many  different  parameter  settings.  For  these  problems, 
the  performance  for  several  GAs  are  given. 


3.1  Checkerboard 

This  problem  was  originally  suggested  by  [Boyan,  1993].  The  object  is  to  create 
a  checkerboard  pattern  of  O’s  and  I’s  in  an  NxN  grid.  Only  the  primary  four 
directions  are  considered  in  the  evaluation.  For  each  bit  in  an  (N-2)(N-2)  grid 
centered  in  an  NxN  grid,  -i-l  is  added  for  each  of  the  four  neighbors  that  are  set  to 
the  opposite  value.  The  maximum  evaluation  for  the  function  is  (N-2)(N-2)*4.  In 
these  experiments  N=16,  so  the  maximum  evaluation  is  784.  30  trials  were 
conducted  for  each  algorithm.  The  distribution  of  results  are  shown  in  Figure  4. 


Figure  4:  Distribution  of  results  for  the  Checkerboard  Problem. 
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3.2  Tree-Max 


In  this  problem  set,  we  randomly  generate  probability  distributions  of  the  form 
shown  in  Equation  1.  For  a  given  problem,  a  single  tree-shaped  network  is 
generated  and  the  probabilities  associated  with  the  nodes  in  these  networks  are 
randomly  generated.  The  value  of  a  given  bit-string  is  the  probability  with  which 
the  randomly  generated  network  would  generate  that  bit- string;  the  task  is  find  a 
bit-string  that  maximizes  this  value.  This  problem  is  designed  to  test  the  extent  to 
which  other  algorithms  can  capture  the  same  statistics  for  which  our  tree 
algorithm  is  specifically  designed.  We  show  results  for  three  different  classes  of 
problems:  in  the  first,  the  probabilities  associated  with  the  nodes  are  chosen 
uniformly  between  0  and  1 ;  in  the  second,  all  probabilities  are  chosen  to  lie  either 
between  0  and  0.2  or  between  0.8  and  1.0;  in  the  third,  all  probabilities  are 
chosen  uniformly  between  0.4  and  0.6.  200  problems  from  each  of  the  three 
distributions  were  tried.  The  results,  in  Figure  5,  show  that  Chains  can  capture 
many  of  the  dependencies  which  Trees  capture,  but  PBIF  and  GA  cannot. 


0 

1 _ 2^ _ 3 _ 4 

Figure  5:  Results  for  Tree-Max. 

For  each  problem,  the  relative  rank  of  the 
four  algorithms  was  computed  (l=Best, 

4= Worst).  All  members  of  a  tie  are 
assigned  the  value  of  the  highest  rank.  A 
histogram  of  ranks  is  presented.  For 
example,  in  (A)  Trees  ranked  first  154 
times.  Chains  131  times,  GAs  ranked  hrst 
56  times,  and  last  106  times.  (Note  that  on 
many  of  these  problems,  there  were  many 
ties).  (A)  Uniform  Distribution.  (B) 
Probabilities  between  0.0-0. 2  and  0. 8-1.0. 
(C)  Probabilities  between  0.4-0. 6. 


~  TREE 


1 _ 2 _ 3 _ 4 
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3.3  The  Peaks  Set  of  Problems 


This  set  of  three  problems  is  based  on  the  four-peaks  problem,  which  was 
originally  presented  in  [Baluja  and  Caruana,  1995].  The  original  four-peaks 
problem  is  defined  as  follows.  Given  an  input  vector  X,  which  is  composed  of  N 
binary  elements,  maximize  the  following: 

FourPeaks{T,  X)  =  MAX{head{l,  X),  tail{Q,  X))  +  Reward{T,  X) 

1  too  if  {head{l,  X)  >  T)  A  {tail{Q,  X)  >  T) 

Reward{T,  X)  =  ] 

[  0  otherwise 

head{b,  x)  =  number  of  contiguous  leading  bits  set  to  b  in  X 
tail(b,x)  =  number  of  contiguous  trailing  bits  set  to  b  in  X 


Fitness  is  maximized  if  a  string  is  able  to  get  both  the  REWARD  of  100  and  if 
the  length  of  one  of  head(l,X)  or  tail(0,X)  is  as  large  as  possible.  The  four  peaks 
problems  also  have  two  suboptimal  local  optima  with  fitnesses  of  N  (independent 
of  T).  One  of  these  is  at  tail(0,X)=N,  head(l,X)=0  and  the  other  is  at  tail(0,X)=0, 
head(l,X)=N.  Hill-climbing  will  quickly  get  trapped  in  these  local  optima.  For 
hill-climbing  to  work  well  here,  it  must  repeatedly  make  “correct”  decisions 
while  searching  large  plateaus;  this  is  extremely  unlikely  in  practice.  By 
increasing  T,  the  basins  of  attraction  surrounding  the  inferior  local  optima 
increase  in  size  exponentially  while  the  basins  around  the  global  optima  decrease 
at  the  same  rate. 

A  version  of  the  modification  to  the  four-peaks  problem  suggested  by  [De 
Bonet  et.  al,  1997]  is  tried  here.  In  this  problem,  “Six-Peaks”,  two  more  peaks  are 
added.  Optimization  methods  may  oscillate  between  multiple  answers,  since  the 
two  optimal  solutions  require  opposite  values  in  many  of  the  bit  positions. 


SixPeaks{T,  X)  =  MAX{head{Xf^,  X),  tail{—iXQ,  X))  +  Reward{T,  X) 


Reward{T,  X)  = 


if  {head(XQ,  X)>T)a  {tail{^XQ,  X)  >  T) 
otherwise 


The  final  version  of  the  peaks  problem  which  was  examined  was  the 
“Continuous-Peaks”  problem.  Rather  than  forcing  O’s  and  I’s  to  be  at  opposite 
ends  of  the  solution  string,  they  are  allowed  to  form  anywhere  in  the  string.  For 
this  problem,  a  reward  is  given  when  there  are  greater  than  T  contiguous  bits  set 
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Four  Peaks 


Four  Peaks  -  Performance  of  Different  GAs 


Continuous  Peaks 


T  Parameter 


Figure  6:  Number  of  runs  (out  of  100)  that  the  reward  was  found  (A)  Four  Peaks,  (C)  Six  Peaks, 
(D)  Continuous  Peaks.  In  (B)  the  performance  of  the  different  GAs  is  measured  on  Four  Peaks. 


to  0,  and  greater  than  T  contiguous  bits  set  to  1 . 

Since  the  four-peaks  problem  has  previously  been  studied,  we  used  the  same 
parameter  settings  as  in  the  previous  studies  -  PBIL  (mutation  rate  =  0.0,  update 
from  the  top  two  vectors).  For  the  GA,  the  previously  used  settings  did  not  work 
as  well  (perhaps  since  the  total  bit  length  is  less  than  the  previously  studied 
version),  therefore,  four  GAs  were  experimented  with  here.  All  used  one-point 
crossover  [Goldberg,  1989]  instead  of  uniform  crossover,  since  these  problems 
were  custom-designed  to  benefit  from  this  operator.  GAl  used  a  population  size 
of  200,  with  mutation  rate  0.001;  GA2  used  a  population  size  of  200,  with 
mutation  rate  0.01;  GAS  used  a  population  size  of  500  with  mutation  rate  0.001; 
and  GA4  used  a  population  size  of  500  with  mutation  rate  0.01.  GA4  performed 
the  best,  so  all  the  GA  results  are  reported  with  it  in  all  of  the  problems.  Note  that 
since  each  algorithm  was  allowed  the  same  number  of  generations,  GAS  &  GA4 
used  500*2000  evaluations,  while  all  other  algorithms  used  200*2000. 
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3.4  Real- Valued  Functions 


It  is  common  to  use  the  types  of  optimization  algorithms  discussed  in  this  paper 
on  real-valued  functions  which  have  been  discretized  to  an  arbitrary  precision.  In 
this  section,  two  multi-dimensional  optimization  problems  are  tested.  Both  of  the 
problems  were  chosen  because  none  of  the  parameters  can  be  set  independently. 

Problem  1:  Summation  Cancellation 

In  this  problem,  the  parameters  (s^,  ...,  s^)  in  the  beginning  of  the  solution  string 
have  a  large  influence  on  the  quality  of  the  solution.  The  goal  is  to  minimize  the 
magnitudes  of  cumulative  sums  of  the  parameters.  Small  changes  in  the  first 
parameters  can  cause  large  changes  in  the  evaluation.  The  evaluation  is: 


-0.16<s,.<0.15 
i  =  1...N 


yi  =  h 


yi  =  ^i  +  yi-i 

i  =  2...N 


C  = 


1 


100000 


/  = 


1.0 


N 

c  + 

IN 

1 

To  test  the  effects  of  increasing  the  dimensionality  of  the  problem,  we  tried 
varying  the  dimensions  (N)  between  10  and  30.  For  each  number  of  dimensions, 
100  problems  runs  were  simulated.  In  all  of  these  problems,  each  parameter  was 
represented  with  5  bits,  encoded  in  standard  base-2,  with  the  values  uniformly 
spaced  between  -0.16  and  0.15.  The  results  are  shown  in  Table  I. 


Table  I:  Summation  Cancellation-  Standard  Binary  Encoding.  Average  value  of  best  solution 
found  over  100  runs  of  2000  generations  each.  (Goal:  Maximize  Value) 


Parameters 

Bits 

Tree 

Chain 

95%  Significance 
Tree/Chain 

PBIL 

GA 

10 

50 

53.7 

54.1 

Yes 

TTT) 

12 

60 

29.3 

24.1 

Yes 

16.1 

9.3 

15 

75 

16.8 

16.9 

No 

11.2 

5.8 

17 

85 

13.8 

13.7 

No 

9.5 

4.7 

20 

11.0 

10.9 

No 

7.5 

3.3 

23 

8.5 

6.4 

Yes 

6.0 

2.4 

25 

125 

6.3 

4.2 

Yes 

5.0 

2.2 

27 

135 

4.2 

3.0 

Yes 

4.4 

1.9 

30 

150 

2.6 

1.9 

Yes 

3.6 

1.6 

When  using  algorithms  which  operate  in  a  binary  alphabet,  numerical 
parameters  are  often  encoded  in  Gray  code.  Gray  code  avoids  the  Hamming  cliffs 
which  can  be  present  between  consecutive  numbers  when  encoded  in  standard 
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base-2  representation.  We  repeated  the  experiments  above  (again  with  100  trials 
per  dimension  setting)  using  Gray  coding.  The  results  are  presented  in  Table  II. 
Because  the  standard  GA  performed  poorly,  and  GA-based  optimization  often 
benefits  from  higher  mutation  rates  when  the  parameters  are  encoded  in  Gray 
code,  the  parameters  were  hand-tuned.  The  second  GA  used,  GA-2,  has  a 
mutation  rate  20  times  that  of  the  original  GA.  Note  that  because  of  the  problem 
specification,  small  changes  in  the  sum  in  the  denominator  of  the  evaluation 
function  can  lead  to  enormous  differences  in  evaluation. 


Table  II:  Summation  Cancellation-  Gray  Encoding.  Average  value  of  best  solution  found  over 
100  runs  of  2000  generations  each.  (Goal:  Maximize  Value) 


Parameters 

Tree 

Chain 

95%  Significance 
Tree/Chain 

PBIL 

GA 

(mut=0.001) 

GA2 

(mut  =0.02) 

lii 

No 

ujjjuum 

12 

42053.3 

54041.8 

No 

100000 

2020.4 

100000 

15 

4047 

No 

100000 

100000 

17 

26.0 

26.7 

Yes 

98001.5 

100000 

20 

14.4 

13.7 

Yes 

94005 

5.1 

93001.4 

23 

8.1 

7.6 

Yes 

90008.7 

3.7 

10011.8 

25 

5.0 

5.3 

Yes 

2.9 

6.0 

27 

4.4 

4.0 

Yes 

2.9 

3.4 

30 

3.0 

2.9 

Yes 

32039.3 

2.2 

2.2 

Problem  2:  Solving  a  System  of  Linear  Equations 

The  goal  of  this  problem  is  to  solve  for  X  in  DX=B,  when  given  a  matrix  D  and 
vector  B.  Although  there  are  many  standard  techniques  for  solving  this  problem, 
it  also  serves  as  a  good  benchmark  for  combinatorial  optimization  algorithms 
[Eshelman  et.  al,  1996].  Since  all  of  the  parameters  are  dependent  on  each  other, 
optimization  is  difficult.  In  this  study,  D  is  a  9x9  matrix,  B  and  X  are  vectors  of 
length  9.  Each  value  in  X  is  an  integer  between  [-256,  255].  The  solution 
encoding  is  81  bits.  D  is  randomly  generated,  and  B  set  such  that  there  is 
guaranteed  to  be  a  solution  to  DX=B.  The  results  for  the  algorithms  are  presented 
in  Table  III;  a  total  of  9  problems  were  tried  with  25  runs  per  problem.  The  goal 
is  to  minimize  the  summed  parameter-by-parameter  absolute  difference  of  DX 
and  B.  Note  that  the  GAs  used  in  the  previous  experiments  (GAl  &  GA2)  did  not 
perform  well  here. 
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Table  III:  System  of  Linear  Equations:  Average  value  of  best  solution  found  over  25  runs  of 
2000  generations  each.  (Goal:  Minimize  Error) 


Standard  Base-2  Encoding 

Gray  Coding 

Tree 

Chain 

95%Sig. 

Tree/ 

Chain 

PBIL 

GA 

Trees 

Chain 

95%Sig. 

Tree/ 

Chain 

PBIL 

GA 

(mat 

.001) 

GA2 

(mat 

0.02) 

GA3 

(mut 

.005) 

648 

No 

No 

mil 

865 

721 

1537 

Yes 

799 

No 

1246 

2662 

955 

544 

Yes 

2981 

4055 

387 

346 

No 

1011 

1524 

1909 

785 

1347 

Yes 

3204 

4268 

830 

911 

No 

2120 

1628 

2218 

990 

848 

1255 

Yes 

851 

Yes 

1985 

1253 

2265 

834 

313 

393 

Yes 

335 

No 

621 

1205 

2190 

720 

1034 

Yes 

3331 

3805 

809 

723 

No 

1987 

1769 

2132 

1066 

1029 

Yes 

2996 

3965 

684 

641 

No 

1753 

1311 

2584 

934 

577 

904 

No 

3135 

3821 

333 

308 

No 

1209 

1732 

2511 

777 

3.5  Equal  Products 

In  this  problem,  there  are  N  randomly  selected  real  valued  numbers,  each  in  the 
range  (0..5).  The  object  is  to  select  a  set  of  these  numbers  such  that 
Y\{selected{X))  =  Y\{notSelected(X)).  The  evaluation  function,  which  must  be 

minimized,  is  the  absolute  difference  between  the  products. 

To  test  the  robustness  of  algorithms 
as  the  problem  size  changes, 
problems  with  N=  40,50,60,70,80, 
and  90  numbers  were  tried.  For 
each  value  of  N,  16  problems  were 
tried,  with  100  runs  per  problem. 

Because  of  the  large  number  of 
trails,  only  the  chain  and  tree 
algorithms  were  attempted.  The 
chain  version  did  better  on  the 
smaller  problems,  with  the  tree  version  doing  better  on  the  larger  problems.  In 
Table  IV,  the  “Tree  Better”  column  shows  how  many  times  (out  of  16)  the  Tree 
algorithm  outperformed  Chains.  The  “signif  (>95)”  column  shows  the  number  of 
problems  in  which  the  difference  between  the  chains  and  trees  was  significant  to 
95%  (using  the  Mann- Whitney  Test).  The  same  statistics  are  shown  for  Chains. 


Table  IV:  Results  for  Equal  Multiplications 


N 

Tree 

Better 

Signif. 

(>  95)? 

Chain 

Better 

Signif. 

(>  95)? 

40 

0/16 

n/a 

16/16 

16/16 

50 

8/16 

o 

— 

00 

8/16 

0/8 

60 

15/16 

5/15 

1/16 

0/1 

70 

14/16 

6/14 

1/16 

0/1 

80 

15/16 

1/15 

2/16 

0/2 

90 

14/16 

4/14 

2/16 

0/2 
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3.6  Summary 


In  this  section,  we  have  presented  a  large  number  of  empirical  results.  Five  sets 
of  problems  were  tried,  using  four  algorithms  (Trees,  Chains,  PBIL  &  GAs).  In 
almost  all  of  the  problems,  the  Tree  and  Chain  algorithms  outperformed  both 
PBIL  and  GAs.  The  exception  to  this  was  the  “Summation  Cancellation” 
problem  utilizing  Gray  encodings.  As  can  be  seen  from  the  results  with  the 
varieties  of  GAs  used  on  the  Gray-Code  problems,  the  frequency  with  which 
mutation  is  applied  can  have  a  large  impact  on  the  performance  of  the  algorithm. 
The  version  of  the  Tree  and  Chain  algorithms  used  in  this  study  did  not  have  any 
form  of  mutation;  however,  they  could  easily  be  extended  to  include  mutation¬ 
like  operations. 

Between  the  Tree  and  Chain  algorithms,  the  Tree  algorithm  performed  at  least 
as  well  as  the  Chain  algorithm  on  almost  every  problem  examined,  and  often 
performed  significantly  better.  The  notable  exception  to  this  is  the  small  versions 
of  “Equal  Products”  problems,  in  which  the  Chains  performed  better.  We  are 
currently  analyzing  what  features  of  this  problem  cause  Chains  to  perform  better. 

Perhaps  the  most  important  result  to  notice  is  that  in  most  cases  examined,  the 
performance  of  the  combinatorial  optimization  algorithms  consistently  improved 
as  the  accuracy  of  their  statistical  models  increased. 


4.  CONCLUSIONS  &  FUTURE  DIRECTIONS 

We  have  shown  that  using  incrementally  learned  second-order  probabilities  to 
generate  optimal  tree- structured  probabilistic  networks  can  significantly  improve 
the  performance  of  combinatorial  optimization  algorithms.  Two  obvious 
questions  arise:  could  these  results  be  extended  to  handle  higher-order 
dependencies,  and  would  modeling  such  dependencies  result  in  more  effective 
combinatorial  optimization  algorithms? 

Suppose  we  are  given  a  set  of  variables  X;  a  database  D={Ci,  ...,  Cj^^},  where 
each  Cj  is  an  instance  of  all  variables  in  X;  a  scoring  metric  M(D,  B)  which 
evaluates  how  well  a  network  structure  B  models  the  dependencies  in  the  data  D; 
and  a  real  value  p.  The  k- LEARN  problem  may  be  described  as  follows:  does 
there  exist  a  network  structure  B  defined  over  the  variables  in  X,  where  each 
node  in  B  has  at  most  k  parents,  such  that  M(D,  B)  >  p? 

The  algorithm  we  have  used  in  this  paper  provides  a  solution  for  the  case  in 
which  k=l  and  M(D,  B)  is  the  likelihood  of  the  data  D  given  the  network  B. 
Unfortunately,  it  has  been  shown  that  k-LEARN  is  NP-complete  for  k  >  1  for  the 
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types  of  scoring  metrics  we  would  probably  wish  to  employ  [Chickering,  et  ah, 
1995].  However,  search  heuristics  for  finding  approximate  solutions  have  been 
developed  for  automatically  learning  Bayesian  networks  from  data  [Heckerman, 
et  ah,  1995].  A  common  approach  is  to  perform  hill-climbing  over  network 
structures,  starting  with  a  relatively  simple  network  -  often  the  optimal  tree¬ 
shaped  network  produced  by  algorithms  similar  to  the  one  presented  here. 

While  it  would  be  computationally  expensive,  it  would  be  interesting  to  see 
whether  learning  more  complex  network  structures  through  such  hill-climbing 
procedures  might  allow  combinatorial  optimization  algorithms  similar  to  PBIL, 
MIMIC,  and  the  algorithm  developed  in  this  paper  to  reach  better  solutions  with 
fewer  evaluations.  It  is  not  clear  a  priori  that  it  is  necessarily  desirable  to  use  a 
more  complex  model  of  the  “good”  bit-strings  generated  so  far:  the  more  tightly 
the  model  fits  the  old  “good”  data,  the  more  heavily  the  algorithm  will  wind  up 
concentrating  on  exploitation  rather  than  exploration.  Nonetheless,  we  feel  that 
experiments  with  more  flexible  probabilistic  models  would  be  invaluable. 

The  types  of  statistics  maintained  in  our  algorithm  can  be  used  to  combine  the 
outputs  from  other  methods  of  optimization.  For  example,  multiple  hill-climbing 
or  genetic  algorithm  runs  are  often  used  to  find  different  local  optima.  The  best 
solutions  from  multiple  runs  of  any  of  these  algorithms,  or  even  from  different 
algorithms,  can  be  used  to  collect  the  second-order  statistics  used  in  this  study. 
Once  these  statistics  are  collected,  and  the  dependency  graph  is  created,  the 
candidate  solutions  generated  by  our  algorithm  can  be  used  either  to  update  the 
statistics  used  by  our  algorithm,  or  to  re-initialize  the  other  algorithms  with  good 
starting  points.  If  the  second  method  is  employed,  rather  than  updating  the 
statistics  with  the  strings  generated  from  the  previous  model,  the  model  could  be 
updated  with  the  strings  returned  by  the  separate  search  procedure.  Somewhat 
similar  methods  have  been  explored  by  [Boyan  &  Moore,  1997]. 

Many  ideas  used  in  other  combinatorial  algorithms  can  easily  be  incorporated, 
such  as  mutation,  weighting  the  contribution  of  candidate  solutions  according  to 
their  evaluations,  and  explicitly  recording  and  using  old  solutions  to  model 
probability  distributions.  Additionally,  our  algorithm  can  easily  be  extended  to 
handle  variables  with  more  than  two  values.  We  are  also  currently  extending  it  to 
handle  real- valued  variables.  There  are  many  opportunities  here  for  exploiting 
recent  research,  such  as  [Geiger  and  Heckerman,  1994],  on  learning  Bayesian 
networks  for  real- valued  functions. 
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